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This paper discusses a nonparametric regression model that nat- 
urally generalizes neural network models. The model is based on a 
finite number of one-dimensional transformations and can be esti- 
mated with a one-dimensional rate of convergence. The model con- 
tains the generalized additive model with unknown link function as 
a special case. For this case, it is shown that the additive compo- 
nents and link function can be estimated with the optimal rate by 
a smoothing spline that is the solution of a penalized least squares 
criterion. 

1. Introduction. This paper presents a general class of nonparametric 
regression models with unknown link functions. The models include neu- 
ral network structures where link functions enter into the model on dif- 
ferent levels. The inputs into the nodes of the net are modeled as sums 
of transformations of lower level inputs. Different approaches to modeling 
the transformations are allowed, including smooth nonparametric functions, 
shape-restricted nonparametric functions and parametric specifications. We 
show that rate optimal estimation in this class of models can be achieved by 
penalized least squares. The proof of the result relies on direct application 
of empirical process theory. 

The approach described in this paper permits a unified treatment of a 
large class of models that includes some well-known examples. The pro- 
posed estimation method can be implemented in practice by using smooth- 
ing splines. 
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The simplest form of our model is a generalized additive model with an 
unknown link function. That is, 

(1) Y = F[m 1 (X 1 ) + --- + m d (X d )]+U, 

where X 1 , . . . ,X d are one-dimensional components of a (i-dimensional co- 
variate vector, F and mi, . . . ,md are unknown functions and U is an un- 
observed error variable satisfying £7[Z7|-X] =0. We first discuss estimation 
of this model when all the unknown functions belong to the same smooth- 
ness class. We will show that these functions can be estimated with L2-rate 
n —k/(2k+i) jf they are /e-times differentiable. Penalized least squares estima- 
tors with properly chosen penalty functions achieve this rate. The rate is 
optimal because it would be optimal if the link function were known. As a 
corollary, we will get the result that this rate carries over to models that 
assume more structure on F and mi, . . . , m^. Empirical process theory is our 
main tool for obtaining rate optimality. See van de Geer [40] for a compre- 
hensive exposition of the use of empirical process theory in nonparametric 
estimation. Applying these techniques, it can be shown relatively directly 
that the function (x±, . . . , Xd) F[mi(x\) + • • • + md{xd)] can be estimated 
with rate n ~ k /( 2k + 1 ) , The main difficulty is to show that this rate carries over 
to the estimation of the functions F and mi, . . . , m^. Clearly, identification 
of these functions requires normalizing restrictions. 

If the link function, F, is known to be the identity function, then (1) is a 
nonparametric additive regression model. This model has been extensively 
studied. Stone [35, 36, 37] and Newey [30] have shown that optimal L2-rates 
can be achieved by piecewise polynomial fits and regression splines. Breiman 
and Friedman [4] and Buja, Hastie and Tibshirani [5] discuss backfitting for 
additive models. Opsomer and Ruppert [34] and Opsomer [33] considered 
pointwise asymptotic distribution theory for backfitting. Mammen, Linton 
and Nielsen [22] introduced smooth backfitting estimates, a modification 
of backfitting that works more reliably in the case of many components 
and irregular design and that allows a complete asymptotic theory. Nielsen 
and Sperlich [31] and Mammen and Park [24, 25] discuss practical imple- 
mentation of smooth backfitting. Tj0stheim and Auestad [38], Linton and 
Nielsen [21] and Fan, Hardle and Mammen [9] discuss marginal integra- 
tion estimators. See Christopeit and Hoderlein [6] for a related approach. 
Horowitz, Klemela and Mammen [13] showed that in an additive model with 
a known identity link function, each additive component can be estimated 
with the same pointwise normal asymptotic distribution that it would have 
if the other components were known. Estimation and inference for general- 
ized additive models with known link functions that are not necessarily the 
identity function have been discussed by Hastie and Tibshirani [11], Linton 
and Hardle [20], Linton [19], Kauermann and Opsomer [18], Hardle, Huet, 



REGRESSION MODELS WITH UNKNOWN LINKS 



3 



Mammen, and Sperlich [10], Yu, Park and Mammen [43] and Horowitz and 
Mammen [14]. These models are natural generalizations of generalized linear 
models (Nelder and Wedderburn [29], Wedderburn [41] and McCullagh and 
Nelder [28]). Generalized additive models have been put in a larger model 
framework in Mammen and Nielsen [23]. Generalized additive models with 
unknown link function have been treated in Horowitz [12] and Horowitz 
and Mammen [15]. The latter paper generalizes Ichimura's [16] approach 
for semiparametric single-index models. Coppejans [7] considered a class of 
additive models that is based on Kolmogorov's theorem on representation 
of functions of several variables by functions of one variable. 

In this paper we will discuss the nonparametric regression model 



Y = m 



Li / L 2 r 



(2) 



./i=i \h=i 



Jp— 1=1 



£ m h ,..., lp (X 1 ^) 



■lv = l 



where m, mi, . . . ,mL t ,...,L v are unknown functions and X ll '"' ,lp are one-dimen- 
sional elements of a covariate vector X, which may be identical for two dif- 
ferent indices (li, . . . ,l p ). This model is a natural generalization of neural 
networks where all functions are parametrically specified. 

The remainder of the paper is organized as follows. The next two sec- 
tions discuss the generalized additive model (1). Optimal estimation of the 
regression function (xi, . . . ,xj) F\mi(x\) + • ■ • + md(xd)] is discussed in 
Section 2. In Section 3 we show that this result implies that the estimates 
of the functions F and mi, . . . , are rate optimal. Section 4 discusses rate 
optimal estimation in model (2). Section 5 considers regression quantiles in 
models (1) and (2). Section 6 presents the results of a simulation study that 
illustrates the finite-sample performance of our method. Section 7 concludes. 
The proofs of all results are in Section 8. 



2. Optimal estimation in generalized additive models. In this section we 
discuss rate optimal estimation for model (1). We suppose that the response 
variables (i = 1, . . . , n) are given by 

(3) Yt = F[ mi {Xl) + ■■■ + m d {Xf)\ + U it 

where X\ denotes the jth component of the covariate vector Xi = (X\ , . . . , Xf) 
and Xi may be fixed in repeated samples or random. If the covariates are 
fixed, we assume that the unobserved random variables U\, . . . , U n are inde- 
pendently distributed with E[Ui\ = 0. If the covariates are random, we as- 
sume that U\, . . . ,U n are conditionally independent and that i£[E/j|Xj] = 0. 
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The functions F and mi, . . . , m d are assumed to belong to a specified class 
A4. M can be the class of all functions or it can incorporate shape restric- 
tions, such as monotonicity, on some components of (F,mi, . . . ,m d ). 

We estimate F and mi, . . . , m d by penalized least squares. The estimator 
(F, mi, • • • , fhd) minimizes 

n 

(4) 77- 1 - F[ mi (X l ) + ■■■ + m d (Xf))} 2 + A 2 J(F, mi, ... , m d ) 

i=i 

over (F, mi, . . . , m d ) S .M. Here J(F, mi, . . . , m d ) is a penalty term that 
measures smoothness of order with fe the number of times the functions 
F, mi, ... ,md are differentiable. The choice of J is somewhat delicate be- 
cause we want J to have the same value for all choices of (F, mi, . . . ,m d ) 

that result in the same function (x±, . . . ,x d ) — > F[mi(xi) H l-m^a;,))]. As 

we discuss below, this can be achieved by the following choice of J: 

J (F, mi, ... , m d ) = Jj" 1 (F, mi, . . . , m d ) + J2 2 (i 7 , mi , . . . , m rf ), 

(- d 1 (2fc-l)/4 

Ji(F,mi,...,m d ) = r jfe (F)|^[T 1 2 (m J )+r fe 2 (m i )] \ 

J 2 (F,m l ,...,m d )=T l (F){ y £l T ? K ) + T fc 2 (m,-)] 

li=i 

constants i'i , > that satisfy f 2 > v\ , and 

for <l <k and any integrable function /. The (possibly random) sequence 
(A n :n = 1,2, . . .) satisfies conditions that are given in assumption (A5) be- 
low. We conjecture that the performance of the estimator does not strongly 
depend on the choices of the constants V\,V2, but we allow here this addi- 
tional flexibility because a certain choice may simplify the numeric calcula- 
tion of the estimator. 

In fact, the theory that follows does not require {F ,rh\, . . . ,rh d ) to really 
minimize (4). It suffices for (4) to differ from its minimum by a term whose 
size is at most of order Op(n~ 2k ^ 2k+1 ^). In what follows, we will assume that 
the estimate is chosen so that this holds. This also simplifies the numerical 
implementation of the estimator. We return to this point below. We call the 
resulting estimates approximate minimizers of (4). 

Further normalizing assumptions are needed to identify the functions 
(F,mi, . . .,m d ) in (3). To see this, let a > and (3 = (j3\ , j3 d ) G M d be 
constants. Define 




(5) 



F a> p(x) = F[a(x + (3 1 + --- + (3 d )] 
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and 
(6) 

for j = 1, . . . , d. Then 

(7) F a fi[mx^p(xi) + 



+ rn d ^p(x d )\ = F[m\(xi) H h m d (x d )]. 



Thus, the regression function (conditional mean function of Y) is the same 
for all choices of a > and (3 £ In fact, for a given regression function 
H(x) = F[m\(xi) + • • ■ + m d (x d )} and under mild regularity conditions, the 
functions F and mi, ■ ■ ■ ,m d are identified up to transformations that corre- 
spond to different choices of a > and (5 € M d . The penalty functionals J\ 
and J2 are chosen such that they do not depend on the special choice of a 
and p. That is, 



(8) 

and 

(9) 



Ji(F at/3 ,mi !at /3, m rf>Q>j a) = Ji(F, mi, . . . , m d ) 



J2(F a ,/3,mi >ai i3,..., m^p) = J 2 (F,mi,. . . , m d ) 



for all a > and j3 € M. d . Therefore, the penalty functionals depend only on 
the regression function H(x). We will assume that M is closed under the 
transformations (5) and (6). See assumption (A3). Then without loss of gen- 



erality we can assume that J2j=i[Fi{mj) + ^fc( 



1, and the penalized 



least squares estimator \F,mi, 

1 

n 



, fhd) can be defined as the minimizer of 

n 

^{^-Fim^XD + .-. + m^Xt)]} 2 



(10) 



+ K 



z) 2 dz 



-11/1/2 



+ 



F'{z) 2 dz 



-11/2/2 



over all (F, mi , . . . , m d ) G M with 



d r 

E 



m) >( Xj/ 



(xjfdxj + / m'Axjfdxj 



This norming simplifies the notation when we move to general neural net- 
work models in Section 4. But also other scalings are possible and we will 
use another normalization when we discuss estimation of the additive com- 
ponents and of the link function in Section 3; see (A9) below. 

The penalty functionals Ji and J2 contain the Li norms of the first and 
fcth derivatives of F and mi, . . . ,m d . It can be seen easily that a penalty 
containing only the kth derivatives of these functions will not work here. 
Consider the extreme case in which F is a linear function. Then Tf c (F) = 



G 
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and Tk(rrij) can be made arbitrarily small by using the transformations 
(5) and (6). On the other hand, if m\, . . . ,md are linear functions, then 
Tjb(m,-) = for 1 < j < d and Tk(F) can be made arbitrarily small by us- 
ing the transformations. Therefore, a penalty that depends only on Tk{F) 
and Tfc (mi ),..., Tfc(m^) cannot work because it puts zero penalty on the 
semiparametric specification in which F or the rrij's are linear. 

Our first result states that the regression function H(x) = F[mi(x\) + 

hm^ij)] can be estimated with rate n - k /( 2k + 1 ) . This rate is optimal for 

model (3) with a known link function and unknown additive components un- 
der the assumption that the additive components are k times differentiable. 
Clearly, model (3) is more general, because the link function is unknown. 
Therefore, this rate is also optimal for (3), and our approach provides a rate 
optimal estimator. 

The rate optimality result needs the following assumptions. 

(Al) The covariates X\,...,Xf may be fixed in repeated samples or ran- 
dom and take values in a compact subset of K that, without loss of 
generality, we take to be [0,1]. The random variables Ui,...,U n are 
independent if the covariates are fixed. If the covariates are random, 
then Ui, . . . , U n are conditionally independent given X%, . . . , X n . 

(A2) The functions F and mi, . . . ,m& have k derivatives. Moreover, 

oo, J m^\x) 2 dx < +oo 

for j = 1, ... , d. Furthermore, (F, mi , m^) £ M. . 
(A3) For all a > and (3 G R d , if {G,m, ...,fi d )eM, then {G a ^,Hx,a,p^ ■ ■■ > 
Mdo/3) £ ■M- [For a definition of G a p,fii a p, . . . ,^dap-> see (5) and 
(6).] 

(A4) The (conditional) distribution of Ui (i = 1, . . . ,n) has subexponential 
tails. That is, there are constants t\j,c\j > such that 

sup E[exp(t\Ui\)\Xi,...,X n ]<cu 

l<i<n 

almost surely for \t\ <tjj. Moreover, E[Ui\Xi, . . . , X n ] = for each i = 
1, . . . , n if the covariates are random, and E[Ui] = for each i = 1, . . . , n 
if the covariates are fixed in repeated samples. 
(A5) A" 1 = O p (?i fc /( 2fc+1 )) and A n = O p (n- k ^ 2k+1 )). 

These conditions are standard and very weak. In (Al) we assume that the 
covariates have a compact support to avoid the need of smoothing estimates 
in the tails of the distribution of X. Moreover, a poor rate of convergence 
for an estimator of one component in the tails could affect the estimator of 
another component in the center of the distribution of X. The (conditional) 
independence of the C/j's can be weakened to permit martingale difference 
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or mixing sequences of dependent variables. This would complicate the tech- 
nical analysis and produce a less transparent treatment. Assumption (A2) 
can be generalized to permit a model that increases with increasing sample 
size. Again, this would make the theory less transparent and would lead to 
an estimation procedure in which the sieve model and penalty factors A n 
have to be chosen data-adaptively. Assumption (A3) entails no less gener- 
ality, because M. can always be enlarged to make (A3) hold. Assumption 
(A4) enables us to use the exponential inequalities needed in empirical pro- 
cess theory. Assumption (A5) allows the possibility that A n is random. This 
includes the important case of a data-adaptive choice of \ n . 

We are now ready to state our first result on rate optimality of our esti- 
mator. 

Theorem 2.1. Let (Al)-(A5) hold with k>2. Then 
(11) 



n- l j^{F[rh l {X}) + ... + m d {X()} 



i=l 

- F[m x {X}) + ■■■+ m d {Xf)}} 2 = O p (n" 2fc /( 2fc+1 )) 
and 

(12) J(F,m 1 ,...,m d )=O p (l). 

We now state a corollary of Theorem 2.1 for random covariates that sat- 
isfy: 

(A6) The covariates X\ , . . . , X n are independently and identically distributed 
with distribution P. 

Theorem 2.2. Let (Al)-(A6) hold with k>2. Then 

{F[fhi(xi) H \-m d (x d )] -F[mi(xi) H h m d (x d )]} 2 P(dx) 

= O p (n- 2fe /( 2fe+1 >) 
and J(F, mi, . . . , fh d ) = O p (l). 



(13) 



Up to this point, we have assumed that the penalty factor A n is the 
same for all components of (F,mi, . . . ,m d ). This has been done to simplify 
the notation. In practice, we can choose a different penalty factor for each 
component function. To do this, we introduce random factors p n fl, ■ ■ ■ ,p n ,d 
and modify the penalty functionals J\ and J2 to 

(- d \ (2fe-l)/4 

Ji(F,mi,...,m d ) = p nfi T k (F) | J2 i T i ( m i ) + Pn ,j T k (™j )] j 
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and 



1/4 



J 2 (F,m 1 ,...,m d ) = T 1 (F)\ ^PlKO + Pn^feK)] 



j'=i 



Then Theorems 2.1 and 2.2 hold if p Hi o, . . . , /9 nj d = Op(l) and p n0 , ■■■,P nd = 



In this paper we only consider L 2 losses. The discussion for sup-norm 
losses is quite different. Optimal rates differ by different powers of n and 
not only by a log-term. This can be seen by the construction in the first 
part of the proof of Theorem 1 in Juditsky, Lepski and Tsybakov [17], which 
implies that for d = 2 and F with 7 bounded derivatives and mi, m 2 with (3 
bounded derivatives up to a logarithmic factor, the order of the optimal rate 
for sup-norm losses is not faster than n~~ 7 /( 27+1+1 / /3 ) . For (3 = 7 = 2, this rate 
is slower than n~ 2 / 5 . Only if one assumes one more degree of smoothness 
for F (7 = 3) does the rate coincide with the optimal L 2 rate for (3 = 7 = 2. 
The basic idea of the construction in Juditsky, Lepski and Tsybakov [17] is 
to consider testing problems with functions F and m 2 both depending on 
n with shrinking support around zero but with fixed mi(ii) = X\. Then for 
estimating m\ and 777,2 for x\ = x 2 = 0, only observations (Xf,Xf) from a 
local neighborhood around (0, 0) can be used. In Horowitz and Mammen [15] 
we study pointwise asymptotics of a kernel smoother in an additive model 
with unknown link under smoothness assumptions = 2, 7 = 3 and we show 
that the pointwise rate n -2 / 5 is achieved. 

3. Optimal estimation of the additive components and link function of 
a generalized additive model. Section 2 discussed how well our penalized 
least squares procedure estimates the conditional mean function, H(x). We 
now discuss the asymptotic performance of the estimators of the additive 
components and link function. We make the following additional assump- 
tions. 

(A7) The covariates (X 1 , . . . , X d ) have a probability density function / that 

is bounded away from and 00. 
(A8) F'(z) is bounded away from for z G {?ni(xi) + • ■ • + m^x^) :0 < 

xi,...,Xd < 1}. The additive components m,j are nonconstant for at 

least two values of j (l<j<d). 
(A9) The functions mi, . . . , m^ and F and their estimates mi, . . . , m^ and 

F are chosen such that 



P (1). 




for j = 1, . . . ,d and 




1 
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These are mild conditions. Condition (A7) implies that the L2 norms with 
respect to the density / and Lebesgue measure are equivalent. This tech- 
nical point is used in the proof of Theorem 3.2. The assumption that the 
link function is monotonic is used for identification. All common choices of 
link functions have this property. The assumption that two additive com- 
ponents are nonconstant is needed for identification. If there were only one 
nonconstant additive component, say, mi, then it would follow trivially that 
F{m\ + const.) does not identify F and m\. Condition (A9) can be always 
achieved because of (A3) and (A8): Condition (A8) excludes the case that 
all functions mi, . . . , m d are constant and because of (A3) all functions in Ai 
can be transformed by (5) and (6), at least if not all additive components are 
constant. Conditions (A8) and (A9) identify the functions mi, ... ,1714 and 
F. This can be seen by a simple argument. We state this in the following 
proposition. 

Proposition 3.1. For continuously differentiable functions F :R — > R ; 
mi : Ai -> R, . . . , m d : A d -> R and G : R -> R, (Mi : A x -► R, . . . , fx d : A d ->■ R 
with intervals Ai, . . . ,A d C R, we assume that the functions irij are noncon- 
stant for at least two values of j (1 < j < d), F'(z) > for z£R, 

F[mi(xi) -\ Vm d (x d )) = G[m{x{) H h^(xd)] 

for Xj £ Aj, l<j<d, 

nij (xj ) dxj = , j Hj(xj) dxj = 

A? 

for 1 < j < d, and 

d 



j=l J A i J=1 J j 



1. 



Then 

rrij(xj) = ^j(xj) 

for Xj E Aj, 1 < j < d, and 

F(z) = G(z) 

for z G {mi(xi) H \-m d (x d ):xi G A x , . . . , x d G 

We now state rate-optimality of our estimates of mi, . . . ,m d and 
Theorem 3.2. Let (A1)-(A9) hold with k>2. Then 

(14) / [fhj(xj) — rrij{xj)Y dxj = Op(n 

Jo 
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and 
(15) 




■i=i 



F 



■i=i 



2 

> dx = P (n~ 2k ^ 2k+ ^). 



We now briefly discuss numerical computation of the estimates. We will 
do this for two approaches. The first is based on B-splines, the second 
one on smoothing splines. Our estimates are not fully specified because 
we require only that the penalized least squares objective function be ap- 
proximately minimized. This leaves some freedom to choose estimates that 
are best suited to computation. The approach based on B-splines will be 
used in the simulations below. In this approach we minimize (4) over B- 
splines mi, . . . ,md and F. If the B-splines are of order k and if they use 
Q^ n 1 /( 2fc + 1 )) knot points, then functions mi,...,md and F that satisfy 
Tk{m\) = O(l), . . . , Tfc(m<j) = 0(1) and Tk(F) = O(l) can be approximated 
with an L2 error that is of order 0(n~ k ^ 2k+l ^). This implies that the 
derivative of F is in supnorm approximated with order o(l) and, thus, 
F[mi(xi) + • • • + md(xd)] is approximated with order 0(n" fc ^ 2fc+1 ^). Thus, 
the minimizer of (4) over B-splines m\ , . . . , and F is an approximate min- 
imizer of (4), as defined in the discussion after (4). The B-spline estimator 
can be calculated by a backfitting algorithm that alternates between two 
steps. In one step, F is held fixed at its current value, and a quadratic ap- 
poximation to the objective function considered as a function of the Fourier 
coefficients of m is optimized. In the second step, fh is held fixed at the 
value found in the first step, and a new value of F is obtained by optimizing 
the objective function over the Fourier coefficients of F. The first step is an 
equality-constrained quadratic programming problem that can be solved by 
the method of Lagrangian multipliers. The second step is an unconstrained 
quadratic programming problem that can be solved analytically. 

The second approach is based on smoothing splines. We will discuss this 
under the additional assumption that the class M does not restrict F or 
one additive component. Condition (A10) makes an assumption for a jo 
with 1 < jo < d. 

(A10) For each (G, m, . . . , fi d ) G M, (G,m, . . . ,// JO _i,^ ,/i io+ i, . . . ,n d ) G 

M. for any function fi* a : [0, 1] M. 
(All) For each (G,//i, . . . G M, (G*,/xi, . . .,fj>d) G M for any function 

G*:R^R. 

Theorem 3.3. Let (A1)-(A8) hold with k>2. 

(i) Let (A10) hold for a jo with 1 < jo < d. Suppose (F, mi, . . . , fhd) is an 
approximate minimizer of (4)- Let fhj be chosen among natural splines m J0 
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of order 2k with knots , . . . , Xff so that (F, fhi, . . . , fhj Q -i, fhj , fhj +i, 
. . . , fhd) minimizes (4 ) among (F, mi, . . . , rhj -i,mj ,fhj +i, . . . , fhd). Then, 
(F, fh\ , . . . , fhj —i, rhj , rhj +i , • ■ • , fhd) is also an approximate minimizer of 
(4) and, therefore has the properties stated in Theorems 2.1, 2.2 and 3.2. 

(ii) Let (All) hold. Suppose (F,fhi, . . . ,fhd) is an approximate minimizer 
of (4). Let F be chosen among natural splines F of order 2k with knots 
m l \x{) + --- + m d {Xf),.. y m l (X\) + --- + m d {Xp so that (F, mi, . . . ,fh d ) 
minimizes (4) among (F, mj, . . . ,mj. Then, (F, mi, . . . , md) is also an ap- 
proximate minimizer of (4) and, therefore, has the properties stated in The- 
orems 2.1, 2.2 and 3.2. 

Natural splines of order 2k with knots at the design points arise as mini- 
mizers of a penalized least squares criterion for the classical nonparametric 
regression problem with a one-dimensional regression function and are also 
called smoothing splines. See, for example, Eubank [8]. 

We now discuss application of Theorem 3.3 for the case that M contains 
all functions. Then (All) holds and (A10) holds for all 1 < jo < d. There- 
fore, repeated application of Theorem 3.3 implies that all estimates, F and 
mi, . . . , fhd, can be chosen as natural splines. The computation of the esti- 
mates could be done by application of a backfitting algorithm. In each step 
of the algorithm one estimate (F, fhi, . . . , or fhd, resp.) would be updated. 
This could be done by using standard smoothing spline software. In the 
update of fhi, . . . ,fhd the minimization could be approximately solved by 
linearization. 

4. Estimation of nonparametric neural network regression. In this sec- 
tion we discuss rate optimal estimation of the nonparametric neural network 
model (2). We assume that the response variables Yi are given by 



(16) Yi = m 



Li ( L 2 



J2 m h 

h=x U 2 =i 



>!'»/,>,:•••"'/ >j4 



+ Ui. 



where the covariate vector X^ = (x l i 1 '"' ,lp : 1 < L < Lj, 1 < j < p) may be 
fixed in repeated samples or random. If the covariates are fixed, we as- 
sume that the unobserved random variables U\,...,U n are independently 
distributed with E[Ui) = 0. If the covariates are random, we assume that 
the random variables U\, . . . ,U n are conditionally independent and that 
E[Ui\Xi] = 0. The functions (m,mi, . . . ,771^...^) are assumed to be con- 
tained in a specified class M.. 

We estimate (m, mi, . . . ,mL 1 ,...,L p ) by penalized least squares. The penal- 
ized least squares estimator m,mi, . . . ,77ii 1) ... ) i minimizes 

' L x ( L 2 ^ i ^ 2 

Ui=l U 2 =l 




m 
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+ >l Am) 



over (m,mi, . . . ,mL lr ...,L p ) £ M. with 
J(m) = [T 2 (m) + cT 2 (m)Y, 

J2 T K m h) + ChTl{m h ) 



h=i 



= • ■ • = /J T i( m Lu...,L p . u i p ) + CL 1 ,...,L p _ 1 ,/ p T fc 2 (m Lli ... iLp _ li/p ) = 1 
lp=l 

and z/, c, c\, . . . , cl x ,...,l p >0 constants. It suffices that (17) differs from its 
minimum by a term that is Op(n~ 2k ^ 2k+1 ^). In what follows, we assume 
that the estimate is chosen so that this holds. 

Our first result states that the regression function m can be estimated 
with rate n - k /( 2k + 1 ) > which is optimal for model (16). 

Theorem 4.1. Let (A1)-(A5) hold with k > 2, X},...,Xf replaced 

by xj''"' 1 , . . . ,X^ 1 '"' ,Lp and F,mi, . . . ,md replaced by m, mi,...,mL 1 ,...,L p - 
Then 

Li i 

r ll,...,l v 



(18) 



n 

i=i 



E^ 1 [---^i,..^(^ 1 "'" p )] 



l 2 



5^ m h [ ■ ■m h ,...,ip( x i 1 ''"' P )\ 



O p (n 



-2fc/(2fe+l)i 



and 

(19) J(m) = O p (l). 

We now state a corollary of Theorem 4.1 for the case of random covariates. 

Theorem 4.2. Let (Al)-(A6) hold with k>2, random covariates X},..., 
Xf replaced byX}""' 1 , . . . ,X p , and F, mi,...,m,d replaced by m, mi, ... , 



m Ll ,...,L p - 



(20) 



Then 



m < 



Y^m h [---fh h _ lp (x h ^)} 



Ji=l 



m< 



■rn h ,...,i P ( 



P(dx) = O p (n~ 2k ^ k+ V), 



where P is the distribution ofX{. Furthermore, J(m) = O p (l). 
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We conjecture that all functional components can be estimated with the 
optimal rate O p (n" fc/(2fc+1) ) if (A7) and (A9) hold and m, . . . ,mL u ...,L p -i 
have derivatives that are bounded away from and, for all values of 1 < 
h < L±, . . . , 1 < Z p _i < Lp_i, at least two functions tti^,,. / : 1 < l p < L p are 
nonconstant. This would be a result that is analogous to Theorem 3.2. Such 
a result would be less important for neural networks than for generalized 
additive models. This is because in neural networks one would like to permit 
two elements of X to be identical, which violates (A7). For example, suppose 
the regression function is 

m[mi [mi,! (zi) + mi,2(iE2)] + m 1 [m 2 ,i(x 1 ) + 7712,2(3:3)]]. 

Arguing as in the proof of Theorem 3.2, one could consistently estimate 
the partial derivatives g = m!m!\m'-y x + m'm' 2 m 2 1; g 2 = m'm^m^ 2 and 53 = 
m'm^m^ 2 - By backfitting, one could fit two functions h,2(xi, X2) and /13 (x\,xs) 
such that g(xi,x 2 ,x 3 ) « g 2 (xi, ^2)^2(^1, x 2 ) +53(2:1, 0:3)^3(^1, ^3)- This would 
result in estimates of m! x i/m'i 2 and m 2 i/m'2 2- Solving, again by backfitting, 
log h 2 (x!,x 2 ) = h 2) i(x 1 ) + h 2i2 (x2) and log/i 3 (xi,x 3 ) = ^3,1(^1) + ^3,3(^3) 
would give consistent estimates of m'i 1 , m' 2 \ , 771^ 2 an d m' 2 2 . It is clear that 
it is very hard to establish the conditions under which this approach would 
result in a consistent estimate. It would be even harder to show that this 
approach can be used to get rate optimal estimates of the functions m, mi, 

777-2, mi 7 l, 7711,2 , 777-2,1 and 7712,2- 

5. Regression quantiles. The estimation approach of this paper can be 
extended to M-functionals other than least squares. In this section we will 
discuss quantile estimation. We consider again model (1) or (16), but now 
we choose < a < 1 and we assume that the (conditional) a-quantile of Ui is 
equal to (and not the conditional mean). We define u a (z) = az — zl[z < 0] , 
where / is the indicator function. Define penalized regression quantiles as 
the functions that minimize [up to a term of order Op(n~ 2fc ^ 2fc+1 ^)] 

1 n 

(21) -5> Q {yi - F[ mi {X}) + • • ■ + m d (Xf)]} + A 2 J(F,m u ...,m d ) 



n z -' 

i=i 



or 



-in f Li 

(22) -y> a 

77 f— f 

1=1 



+ >l Jiffl) 



Ui=l 

The penalty terms are as defined in Sections 2 and 4. Make the following 
assumption. 

(A4') The function E[u a (Ui — fj,)\Xi, . . . ,X n ] almost surely has a unique 
minimum at [i = 0. Furthermore, for some e > and all < a < e, it 
holds that 

inf P(0 <Ui<a)>ea 

Ki<n 
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Table 1 

Performance of rhi, rhi and F for different values of n and A 



n 


A 


mi 




F 


400 


0.05 


0.030 


0.029 


0.0040 




0.10 


0.029 


0.024 


0.0039 




0.15 


0.026 


0.029 


0.0048 


900 


0.05 


0.023 


0.018 


0.0030 




0.10 


0.017 


0.015 


0.0027 




0.15 


0.025 


0.017 


0.0036 



and 

inf P(-a <Ui<0)>ea 

l<i<n 

almost surely. 

Theorem 5.1. Let the conditions of Theorem 2.1, Theorem 2.2, The- 
orem 3.2, Theorem 4-1 or Theorem 4-2 hold with (A4') in place of (A4). 
Then the conclusions of the corresponding theorem hold for the estimators 
defined in (21) or (22). 

6. Simulation results. We carried out a small simulation study with Y = 
i ? [mi(X 1 ) + m2(X 2 )] +U, where F is the identity function, m\(x) = sin(7rcc), 
m2{x) = &(3x), <I> is the standard normal distribution, and U ~ N(0, 1). The 
values of (X 1 ^ 2 ) are the grid (i/(n 1 / 2 + l),j/(n l l 2 + = l,...,n 1 / 2 , 
where n is the sample size. The penalty term J is defined with v\ = = 1. 
We used the B-spline approach described in Section 3. The estimates of 
mi, m2 and F are B-splines with four knots. There are 500 Monte Carlo 
replications in each simulation. 

Table 1 shows the empirical integrated mean-square errors of mi, rhi and 
F at three different values of the penalty parameter, A. 

The simulation results with A = 0.10 are shown graphically in Figures 1 
and 2. The wiggles in the estimates of iri2 are due to variance, not bias. The 
4-knot spline fits the true m2 very well. In the simulations our estimators 
show a very reliable performance. 

7. Conclusions and extensions. In this paper we have proposed an esti- 
mation approach for a general class of nested regression models. The basic 
idea is to use the following property of compositions of functions belong- 
ing to certain smoothness classes: if the same entropy rate applies for all 
smoothness classes, then the same entropy rate also applies to the class of 
the composition of the functions. In our setting, the function classes are 
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i 




-l 1 1 1 r 

-1 -.5 .5 1 

X 



1 




-1 - 

■n 1 1 1 r 

-1 -.5 .5 1 

X 



Fig. 1. Performance of fhi (upper plot), mi (middle plot) and F (lower plot) with 
n = 400. The solid line is the true function; the dashed line is average of 500 estimates; 
circles, squares and triangles, respectively, denote the estimates at the 25th, 50th and 75th 
percentiles of the IMSE. 
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subsets of additive Sobolev classes. The results could be extended easily to 
other smoothness classes as long as entropy rates with respect to the supre- 
mum norm are available. Examples are additive Sobolev classes of functions 
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with higher-dimensional arguments. Another point that needs exploration 
is the case in which smoothness classes with different entropy rates enter 
into the model. It would be interesting to check whether each component's 
convergence rate is the one corresponding to the entropy rate of its smooth- 
ness class. In particular, for parametric components it would be important 
to check whether the component can be estimated with rate n _1//2 . Further- 
more, we conjecture that the resulting estimate is efficient. Such a result has 
been proved in Mammen and van de Geer [27] for a partial linear model with 
a known link function. There, penalized quasi-likelihood estimation is used 
for the nonparametric components. Another extension would be to apply 
our results for other classes of M estimators. 

8. Proofs. 

Proof of Theorem 2.1. For a constant c > consider the class of 
functions 

g=f.F[m 1 (x 1 ) + --- + m d (x d )]:\F(z)\<cfor \z\<d,mj(0) = 

d f 1 (k) d f 1 

for j = 1, . . . ,d, ]P / m - (x) 2 dx + ^ / m'j(x) 2 dx = l, 

3=1 J ° 3=1 J ° 

J(F,m 1 ,...,m d ) < 1 

First we will argue that, for a constant Ck, 

(23) H B (s,g,\\ • ||oo) < cfcr 1 /* 

for 5 > 0. Here, || • ||oo denotes the supremum norm. Furthermore, Hb(5,Q, \\ ■ 
Hoc) denotes the 5-entropy with bracketing for the class Q w.r.t. the sup 
norm || • H^. This means that exp(i?^) is the smallest number N for which 
there exist pairs of functions {gx,g\), (gj^,9%) in G with the following 
property. For each g € G there exists 1 < j < N with g^ < g <g^ and \\gV — 
9j ||oo < S. Such a set of tuples is also called a 5-cover with bracketing. 

This entropy bound follows from the following classical entropy bound on 
Sobolev classes (see Birman and Solomjak [3] and van de Geer [40]): 

fl- B ^,|^:[0 I l]-^R:y| QO <l, 

(24) 

J g( k \x) 2 dx<l\ || -Hoc) <C5^ k 
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for a constant C > 0. We now show how (23) follows from (24). From (24) 
one gets for the class of additive functions 

Gadd = { mi(xi) H h m d (x d ) : V / m • j (x) 2 + V / m'Ax) 2 dx < 1, 

mj(0) = o| 

with a constant C" > 

(25) 0«ld J Joo)< C^T 1 '*. 

We use here that j Q l m'j{x) 2 dx < 1 and mj(0) = implies that HttJjHoo < 1. 

Consider now a function F[mi(a;i) + • • • + m d (x d )] that is an element of Q. 
Suppose that mi, . . . , are chosen such that mj(0) = for j = 1, . . . , d 

and 2~Zj=i Jo 1 m ^\ x ) 2 dx + X)^=i Jo m 'j( x ) 2 dx = I. For such a representation 
J(F, mi, . . . , m^) < 1 implies / F^ k \z) 2 dz < 1. Because < c for |z| < d, 

this implies < C for |z| < d with a constant C". This can be seen, for 

example, by application of the interpolation inequality; see (42). Consider 
now a (5-cover with bracketing {g{ , gY), . ■ . , (g^, g%) of £ a dd- Consider a fixed 
function i? with < \F'\ < C". Then [F{g[ ) -C'5, F{g[ ) + C'5], . . . , [F(g%) - 
C'5,F{g1) + C'5] is a (2C"<5)-cover with bracketing of F(£ add ). By a slight 
extension of this argument, we get (23). 

We now apply Theorem 10.2 in van de Geer [40] with the modifications 
discussed before the theorem. This theorem implies (11) and (12). We now 
verify the assumptions of Theorem 10.2 in van de Geer [40]. We have to 
check for e > that, with probability larger than 1 — e, the function H(x) = 
F*[rhi(x\) + • • • + m d (x d )] is an element of Q if c is chosen large enough. 
Here the function F* is defined as F*(z) = F(z) /(l + J(F,fhi, . . .,fh d )). 
W.l.o.g. we can assume that 

(26) J fhf\x) 2 dx + J fhj(x) 2 dx = 1, 

(27) mj(0) = forl<j<d. 

It can be easily checked that J(F*,fhi, . . . ,in d ) < 1. Thus, for the proof of 
H £ Q, it remains to check that 

(28) sup \F*(z)\=0 P (l). 

\z\<d 

We now show (28). Equations (26) and (27) imply that 
sup |mi(xi)H Vm d {x d )\<d. 

0<a;i,...,x [ j<l 
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Furthermore, because of J(F*,fhi, . . . ,m d ) < 1, these equations imply that 
/ F*'(z) 2 dz < 1. This shows that 



sup \F*(z')-F*(z)\<2d. 

\z\Az'\<d 



We now show that 
(29) 



inf \F*(i 

\z\<d 



o P (i). 



The last two bounds imply (28). Thus, it remains to show (29). For the proof 
of (29), note first that by definition of F, fhi, . . . , fh d the following inequality 
holds with F(z) = Y = n~ l Y2=i Y i and Z i = m 1 (X}) + ■ ■ ■ + fh d (Xf): 
-t n 1 n 

- £{Y, - F[Z 4 ]} 2 < - ^{Yi - F[Zi}} 2 + \ 2 n J(F, mi, ... , m d ) 
i=i i=i 

1 n 

< ~ E{ y * " F l Z i]} 2 + X n J ( F > mi, . . . , m d ) 

<7T * • 



n 



1 



i=l 



i=i 



= P (1). 

This implies infui <rf |.F(;z)| = Op(l) because of 



Y 



1 



ii 



i=i 



1 



i=i 



i n 

<-E{*-*W a - 



i=i 



Claim (29) now follows because of \F*\ <\F\. □ 

Proof of Theorem 2.2. For the proof of Theorem 2.2, it remains to 
show (13). This claim immediately follows from Lemma 5.16 in van de Geer 
[40]. □ 

Proof of Proposition 3.1. Without loss of generality, we assume 
that the functions mi and rri2 are nonconstant. Then there exist x\ S A\ 

and x* 2 G A 2 with m'^x*) / and m^xj-j) ^ 0. For H(x) = F[m\(xi) H h 

m d (x d )] = G[m(xi) H \-fJ>d(x<i)], we get tnat W^ H ( X ) if x\ = x\ and 

-£^H(x) / if x 2 = ^2 • For xi € Ai,. .. ,sc d e Ai, put x* = (sc*,a;2,.. . ,x d )' 
and x** = (xi,z|, X3, . . . , x d )' . Then for 2 < j < d, 

m[(xt)~ ^H(x*)~ ^(xt)- 
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Because of J A . nij(xj) dxj = J A [J>j(xj) dxj = 0, this gives for 2 < j < d 

mj(gj) = lijjxj) 
1 ' rn[(xt) ^(xtY 

Using partial derivatives of H at x = x** , we get 

/ 31 x mi(xi) = m(xi) 

m' 2 {x* 2 ) n' 2 {x* 2 y 

Equation (31) implies that m'i(x*) / 'm' 2 (x 2 ) = fj,[(xl) / /j, 2 (x 2 ) . This 
shows that (30) holds for 1 < j < d. Because of J2j=i Ia, m j{ x j) 2 dxj = 
^2^=1 f a fijixj) 2 dxj , this implies the statements of the proposition. □ 

Proof of Theorem 3.2. We first show (14). Put H(x\, . . . ,x d ) = 

F[mi(x±) H \-fhd(x d )] and H(xi, ...,x d ) = F[mi{x{) H \-m d (x d )]. We 

write Hj = d x .H, Hj = d Xj H, Hij = d Xj d Xi H and Hij = d Xj d Xi H for the 

partial derivatives of H and H . 

For 1 < j < d, define fhj(xj) = ")~ l [fnj{xj) — fhj(0)] with j 2 = 

Y^j=iJ rh^p{x) 2 dx + J fh'j(x) 2 dx. Furthermore, choose F so that F[mi(x\) + 

h fh d (x d )] = F[mi(xi) H h fh d (x d )]. 

Then mi, . . . ,m d satisfy (26) and (27) with fhj replaced by m,- and we 
have that 

(32) / mAxjfdx, = O p (1), 

Jo 

(33) £mf\x 3 ) 2 dx j =0 P (l) 

for j = 1, . . . ,d. Note also that fhj(0) = by definition. 

By Sobolev embedding results (see, e.g., Section VI. 7 in Yosida [42] or 
Oden and Reddy [32]), the bounds (32) and (33) give 

(34) sup \mf(x j )\ = P {l) 

aye [0,1] 

for j = 1, . . . , d and < I < k — 1. We now derive a similar bound for the link 
function F. 

From (29) and (12) one gets that inf|^|<rf |F(z)| =Op(l). From Theorem 
2.1 we get that 

(35) [ F {k) (z) 2 dz = P (l). 



By application of the Sobolev embedding, this shows that 
(36) sup[F«(z)| = P (l) 

\z\<d 
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for <l <k- 1. 

The rest of the proof is divided into several steps. 

Step 1. In this step we argue that 



(37) / H*(xydx = P (l), 

J[0,l] d 

where H* is a partial derivative of H of order k. The integral / H*(x) 2 dx 
can be easily bounded by a sum of integrals over products of derivatives of 
F, fh\ or . . . or m^, respectively. Most summands can be easily bounded by 
using (33)-(36). One summand needs a little bit more care, namely, 

F (fc) (mi(xi) H hm d (x rf )) 2 m- (x h f m\ (x ik ) 2 dx. 

[o,i] d 

This term arises when H* is a partial derivative w.r.t. Xi lf ... ,Xi k . Up to a 
factor that is stochastically bounded, this integral is equal to 

(38) f F^(fh 1 (x 1 ) + --- + m d (x d )) 2 m f ii (x il ) 2 dx; 
J[o,i] d 

see also (34). We now apply that for two functions g : [0, 1] — > [a, b], f : [a, b] — ► 
K with a < 6 the following inequality holds: 



(39) /t7(y)]V(i/rrfy<2 



o 







g"(yydy 



1/2 f 6 



By using (39) with / = and g = mi + const., one can easily check that 
the integral in (38) is bounded by 

O p (1) ■ [ d F^Xzfdz. 



This quantity is stochastically bounded because of (35). For the proof of 
(37), it remains to prove (39). For the proof of this inequality, we denote for 
u < v by k(u, v) the number of crossings of the interval [u, v] by the function 
g' . It can be easily checked that 

g"(y)\dy > (v -u)k(u,v), 
where 

l u ,v = {y&[0, l]:u<g'(y)<v}. 
Choose now q = 2~ % . The claim (39) now follows from 
i r 



f[g{y)Yg'{yYd y = f[g{y)Yg\yYd y 

{y-g'(y)m 
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oo „ 

= E / f[9(y)] 2 9'(y) 2 dy 

i = — OG ^ c i> c i — l 
oo . 

+ E J f[g(y)] 2 g'(y) 2 dy 

i=— oo I~ c i—i>~ c i 
oo „ b 

< E [^(ci)Ci-l) + fe(-Ci-l>-Ci)]ci-l / f( z ) 2 dz 



i=— oo 
oo 



< E i/(y)i*// m 2 dz 

i=-oo Ci ~ l °i J I H<<H-l UI -°i-l-<H Ja 



<2 ( X \g"{y)\dy f ' f{zfdz 

JO Ja 







< 2 

Step 2. We now show that 



\"{y) 2 dy 



1/2 r b 

f(z) 2 dz. 



(40) /" [Hj ( Xl ,...,x d )- Hj (x 1 ,...,x d )] 2 dx = P ( 

(41) J [Hij(xi,. . . ,x d ) - H i j(x 1 ,...,x d )} 2 dx = Op( 



l 2(fc-l)/(2fc+l)^ 



t 2(fc-2)/(2fc+l)> 



for 1 <i,j < d. 

For the proof of these claims, we make use of the interpolation inequality 
of Agmon [2]; see also van de Geer ([40], Lemma 10.8) and Mammen and 
Thomas- Agnan [26] . This inequality states that for a function g : R — > K and 
a real number 9 > it holds that 

(42) / ff W (x) 2 dx < c6~ 21 f g{xf dx + c9 2 ^ f g^ {xf dx 



for a constant c and 1 < Z < k. The claims (40) and (41) follow from the 
bound on H — H in Theorem 2.1, (37) and the interpolation inequality. 

Step 3. According to (A7), two additive functions are not constant. 
W.l.o.g. we assume that this is the case for the first two functions. Then 
there exist constants < a\ < b\ < 1 and < 02 < 62 < 1 with 

inf |m'(xj)| > for j = 1, 2. 

a,j<Xj<bj 

In this step we show that uniformly for < x\ < 1 it holds that 
(43) pfh'^xi) = pm'^xx) + o P (l), 
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where 

1 



01 m'^xi) 



dx±, 



P= I —n-^rdxi. 
J ai mi(xi) 

For the proof of (43) note first that (40)-(41) imply that there exist random 
< x%, . . . ,x* d < 1, a 2 < x* 2 < b 2 with 

/ [Hj(xi,xl,...,x* d ) - Hj(xi,xl, . . . ,x d )] 2 dxi 

(44) 

= Op(n- 2 ( fe - 1 )/( 2fc+1 )), 
/ [Hj,i (xi, x* 2 , ...,x d )- H jA (xi , x* 2 , . . . , x d )] 2 dx\ 

(45) 

= P (n- 2 ( fe - 2 )/( 2fc+1 )) 

for j = 1 and j = 2. We now argue that for a (random) function A : [0, 1] — ► K 
the following implication holds. If / A'(ii) 2 = Op(l) and J A(tt) 2 du = 
op(l), then it holds that sup |A(u)| = op(l). This implication can be easily 
verified by using that / A'(u) 2 du = Op(l) implies that 

\A(u) - A(v)\ „ , . 
sup ' | ' , 1/2 ' =0 P {1). 

The latter implication follows by application of an embedding theorem (see 
Adams [1], page 97) or directly by a simple calculation. 

We now apply this result for j = 1 and j = 2 with A(x±) = Hj(x\,x 2 , 
x* d ) — Hj (xi , x 2 , ■ ■ ■ , x* d ) . This gives 

sup \H j (x 1 ,x 2 ,...,x d ) - Hj(xi,x 2 ,...,x d )\ = o P (l). 

0<X!<1 

We now apply this expansion and make use of the fact that |m^|(u) for 
(u € |m 2 |(u) for (u € [02,^2]) and F' are bounded away from zero 

and from infinity. We get the following expansions that hold uniformly for 
< x\ < 1 and a\ < x[ < b±: 

m' 2 {x* 2 ) _ H 2 (x[,x* 2 ,...,x* d ) 



m'lfri) H^, x* 2 , . . . , x* d ) 
_ H 2 (x[,xl,...,x 



Hi(xi, x 2 , . . . , x* d 

1 / f \ +op(l), 
m x [x\ ) 
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™ 2 (4) ^2(^2) P 
This implies that uniformly for < xi < 1 and ai < x' x < b\, 

g*j(gi) mUgi) 1 „ m 

Claim (43) now follows by integrating both sides of the last equality w.r.t. 
x[. 

Step 4. In this step we show that for 2 < j < d and for random sequen- 
ces Vj^ni 

(46) f 1 \mdxj) - rVKfe) - m,-(0)] - 5 jre | 2 <kc,- = Op(n" 2fc /( 2fc+1 )). 
Jo 

For the proof we note first that (40) and the bound on H — H in Theorem 
2.1 imply that there exist random numbers < x\, ■ ■ ■ , x j-i> x j+ii ■ ■ ■ > x 

*<1 

with 



(47) 



(48) 



(49) 



[H — H] (xi, x 2 , • • • , Xj_ i, Xj, Xj+±, . . . , Xrf) dx\ dxj 

= Op(n- 2fe /( 2fe+1 )), 
[Hi - Hif(xi,x* 2 , . . - ,Xj-iiXj,Xj + i, . . .,x* d )dx 1 dx j 

= Op(n~ 2 ( fc - 1 )/( 2fc+1 )), 
[Hj — Hj] 2 (xi, x 2 , • ■ • , Xj_i,Xj, Xj + i, . . . , x d ) dx\ dxj 

= Op(n- 2 ^ k - 1 ^ 2k+1 ^. 

In the following calculations of this step we fix the random vector (xjj, . . . , xf—i, 
x* +1 , . . . , Xj) and, for simplicity of notation, we write f(x±,Xj) instead of 

f(x\, Xjjj • ■ • i Xj_ 1 ,Xj,Xj +1 , . . . , x*i) for the functions f = H, H\,Hj,H, Hi or Hj, 
respectively. We now use that 



, 6 i , Hjixi^Uj) 1 p , . / \i 

atii / afei^e — = / — - — - ax\ m,-(x,-) — m,- 0) 

H x {x x ,Uj) J ai m'^xx) JV j; Jl 



/ai 

(50) 

= /9?7Jj (Xj ) 



and that 



duj / dxi-f- i 

Jai Hi(Xi,Uj 
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(51) 



1 



l ai mi(xi) 
= p[m j (x j )-m j (0)}. 
Furthermore, we make use of the expansion 
1 1 Hx-Hi 

H\~Hi w~ 

This gives the expansion 



dxi [rrij (xj ) — rrij (0)] 



+ (Hi - Hi) 2 



rrijixj) - p 1 p[m j (x j ) - mj{0)} 



(52) 



du. 



In 



dx-i 



H j(xi,Uj) Hj(xi,iij) 



dx\ 



Hx(xi,Uj) Htix^uj) 
Hj ~ Hj (Hi - Hi ) (Hj - Hj ) 



H 



Hi — — - + Hj(Hi — Hi) 2 



Hi 



H\H X 



(xi,Uj). 



Because of (34), it holds that p~ l = Op(l). This bound together with 
(48)-(49) implies for the second term in (52) the bound 



du q 



(53) 



dxi 

ai 

(2fc-2)/(2fc+l) 



(Hi - HjjHj - Hj) 



Hi 



(xi,Uj] 



= P {n 
= Op(n" fc /( 2fc+1 )). 
For estimating the last term in (52), we use that 



sup 

a\<x\<b\fi<Xj<l 



~-l Hj , 

p -Q-(Xl,Xj 

Hi 



sup 



p 



^m'jixj) 



m'^xi) 



Op(l), 



because of (40) and because inf ai < 3 , 1 <b 1 \pfhi(xi) \ > c with probability tend- 
ing to one for a constant c > small enough. The latter fact follows directly 
from (43) and (A6). With this bound, we get for the last term in (52) 

p' 1 / duj / dxi Hj(Hi-Hi) 2 



H 2 Hi 



[xi,Uj) 



(54) 



P (1) 
P (n 



dUn 



o 



dxi(Hi — Hi) 2 (xi,uj) 



-fc/(2fc+l)^ 
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where again (48) was applied. Using (52)— (54), we get uniformly for < 
xj < 1 that 



rrij (xj ) — p 1 p[mj (xj ) — rrij (0)] 



bi 



ditj / dx\ 



«i 



Hi 



(xi,Uj 





\H X -R{\ 


/ duj / dx\Hj 

10 Ja x 


[ Hi \ 



(x l5% -) + Op(n- fc / (2fc+1) ) 



= T l {x j ) + T 2 {x ] ) + P (n- k ^ 2k+ ^). 
We now apply partial integration for the first term. This gives 



T\(xj) = fT l I dx 



Hx(xuUj 



■{H(xi,Uj) - H(x 1 ,u j )} 



dx\ I duj 
1 



o 



dx-i 



Ol 



+ P 



bi 



dxi 



#i(xi,0) 
1 



H 1 (x 1 ,x j 



{H(x 1 ,0)-H(x 1 ,0)} 
■{H(x!,Xj) - H(xi,Xj)} 



Hi :2 (xi,Uj) 
.Hi(xi,Uj) 2 

^',l,n + gj,l,n(Xj) + gj t 2,n(xj), 



+ p~ / dx\ I duj 

Jai JO 



{H(xi,Uj) - H(x!,Uj)} 



with a real random sequence 8j,i,n- The random functions gj,i, n and 5j,2,n 
satisfy 

' g j ,Uxj) 2 dx J =0 P (n- 2k /^), 

' g j *n(xj) 2 dx j = Op(rr 3k 'V* 1 ')). 
Similarly, we get that 

T 2 (xj) = 6j j2 , n + 9j,3,n{xj) + gj,A,n( x j) 

with a real random sequence <5j,2,n and random functions gj,3, n and <7j,4, n 
that satisfy 

^3,n(^) 2 ^ = Op(n- 2fe/(2fe+1) ), 



J gjA,n( x j) 2 dx j =o P (n 



-2fc/(2fc+l) 



)■ 



REGRESSION MODELS WITH UNKNOWN LINKS 



27 



This shows that, for 2 < j <d, 

\fhj{ Xj ) - p- x p\uij{xj) - 77^(0)] - S jihn - 5j,2,n\ 2 dxj = P {n- 2k/{ - 2k+1) ). 
This implies (46). 

Step 5. In this step we show that there exists a random sequence 5\^ n 
such that 

(55) C |mi(xi) - - rmiO)] - 5 1>n \ 2 dx x = P (rT 2fc /( 2fc+1 )). 

Jo 

For this purpose we choose a function s : [0, 1] — ► R that has a continu- 
ous derivative and satisfies s(0) = s(l) = and / s(x2)m' 2 (x2) dx2 = 1. Put 
^(^2) = s(x2)m' 2 (x2)m2(x2). One can easily check that 

~ f Xl f 1 Hi(ui,X2) 

mi(xi)= dui dx2w(x 2 )-~— — 1 — -, 

JO JO H2{Ui,X2) 

where H is defined as in the last step for j = 2. We define 
m 1 (xi) = / au\ I dx2w{x2)- 



#2 (111,2:2 )' 

Proceeding as above, one can show that there exists a random sequence 5\., 
such that 

1 |mi(xi) - m\{ Xl ) -5 ln \ 2 dxt = P {n~ 2k ^ 2k+1 ^). 



In particular, the proof makes use of the following facts: sup,,. |u;(x2)| = 
Op(l), sw£> xu Xd \w(x2)[H 1 /H 2 ](x 1 ,X2, ■ ■ -,x d )\ = P (l) and / w' {x 2 ) 2 dx 2 
Op(l). For the proof of (55), it remains to show that 

1 K(xi) - rVK(^i) - mi(0)]| 2 dx x = P (n- 2fc/(2fc+1) ). 



Because of 



el 

ml(xi) = [mi(xi) - mi(0)] / s(x 2 )m 2 (x2) dx 2 , 







this follows from 
»i 

s(x 2 )fh' 2 (x2) dx 2 



el 

s(x 2 )fh 2 (x 2 )\l- I s' (x 2 )fh2(x 2 ) dx 2 



1 s , (x 2 )p- 1 P^2(x2) dz 2 + Op(n~ fe /( 2fe+1 )) 
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-■1 







s{x 2 )p- 1 pm' 2 (x 2 ) dx 2 + Op(n~ fc /( 2fc+1 )) 



=r 1 p+op(n- fe /( 2fe + 1 )). 

Here, (46) for j = 2 was used. Thus, (55) is proved. 

Step 6. In this step we show that 

(56) P = P {\). 
Arguing as above, we find x 2 , ■ ■ ■ , x* d such that 

sup \H(xi,x* 2 ,...,x* d ) - H(xi,xl,...,x* d )\ = o P (l). 

0<ii<1 

The claim now directly follows from 

< inf F'(z)\mi(bi) - mi(cti)| 

< |F[mi(6i) + m 2 {x* 2 ) + ■■■ + m d (x* d )] 

- F[mi(ai) + m 2 (x* 2 ) ^ \-m d (x* d )]\ 

= |F[mi(6i) + m 2 {x* 2 ) + ■■■ + m d (x* d )} 

— F[fhi(ai) + ih 2 (x 2 ) H hm^i^l +op(l) 

< sup F' (z)\fhi(bi) - fhi(a\)\ + op(l) 

z 

= Op(l)p~ 1 p\m 1 (b 1 )-m 1 (a 1 )\+o P (l). 
Step 7. In this step we show claim (14). 

Using / rrij(xj) dxj = j fhj(xj) dxj = 0, the definition of fhj, (46) and (55), 
we get for 1 < j < d 

(57) f 1 ^fhAxA - p~ x pmj{xj)\ 2 dxj = P {n~ 2k l { - 2k+1 ^). 
Jo 

Here we have used that for a function w it holds that f[w(x) — J w(u) du] 2 dx < 
J w(x) 2 dx. 

We now use that for a constant a > and for two functions w\ and w 2 with 
/ w\{x) 2 dx = j w 2 (x) 2 dx = 1, it holds that J[awi(x) — a~ 1 w 2 (x)] 2 dx < 
J[wi(x) —w 2 (x)] 2 dx. This shows 



7 1 P 1 P 



d „i 

} 2_ \ I | fhj (xj) — rrij (x j ) | 2 dxj 

3=1 J ° 



d f 1 I ■ 1 2 

~tJo V \/^f~ L PP~ 1 



0p{n - 2k / {2k+ l)y 



J 
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Furthermore, because of (56) and (34), we have pp 1 = Op(l) and p~ l p = 
O p (1). With (57), this gives 7~ 2 = p~ 2 p 2 + o P (l) and, thus, 7 = O p (1). 
Therefore, the last inequality implies (14). 

Step 8. It remains to show (15). From (36) with I = 1 and (57), we get 

(58) sup \F'(z)\ =Op(1). 

\z\<d 

Claim (15) immediately follows from (14), (58) and Theorem 2.2. □ 

Proof of Theorem 3.3. We give a proof of part (i). Part (ii) fol- 
lows by similar arguments. Suppose that (F,fhi, . . . , m^) is an approximate 
minimizer of (4) over M. with fhj not necessarily a natural spline. Define 
fhj as minimizer of / p^ k \u) 2 du under the constraint p(Xj°) = fhj (Xj°) 
for 1 < i < n. The function fhj is a natural spline of order 2k with knots 
X( , . . . ,Xl°; see, for example, Eubank [8]. We show that 

(59) Jm' j0 (u) 2 du = O P (l). 

This immediately shows that (F, mi, . . . , m JO _i, ?7ij , mj +i, . . . , fhd) is an 
approximate minimizer of (4) over M. and, thus, it implies the statement of 
Theorem 3.2 (i). It remains to show (59). This follows from 

(60) J A'{u) 2 du = P (l), 

with A (it) = fhj (u) — fhj (u) . For the proof of (60) note that, by the Sobolev 
embedding theorem, one can write A(z) = Ai(z) + ^(z) with 

and |A 2 (z)| < [J A( h \z) 2 dz} 1 / 2 = Op(l); see, for example, Oden and 
Reddy [32]. Because of A(x/°) = for 1 < i < n, we get that p 1 ,...,f3 k = 
Op(l). This implies / A(n) 2 du = Op(l). Now (60) follows from the inter- 
polation inequality (42). □ 

Proof of Theorems 4.1 and 4.2. The theorems follow by similar 
arguments as in the proofs of Theorems 2.1 and 2.2. □ 

Proof of theorem 5.1. The proof of the quantile version of Theorem 
2.1 and Theorem 4.1 follows along the same lines as in the old proofs. For 
the necessary modifications to apply empirical process theory, see van de 



30 



J. L. HOROWITZ AND E. MAMMEN 



Geer [39] and Chapter 12 in van de Geer [40]. Note, for example, that (for 
a = 1/2) condition (A4') restates (12.22) and (12.23) in van de Geer [40]. 
Compare also Exercise 12.4 in van de Geer [40]. The quantile versions of 
Theorem 2.2, Theorem 3.2 and Theorem 4.2 directly follow from the new 
versions of Theorem 2.1 and Theorem 4.1 by the same arguments as in the 
proofs of their old versions. □ 
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